Grammatical Bigrams
نویسنده
چکیده
Unsupervised learning algorithms have been derived for several statistical models of English grammar, but their computational complexity makes applying them to large data sets intractable. This paper presents a probabilistic model of English grammar that is much simpler than conventional models, but which admits an efficient EM training algorithm. The model is based upon grammatical bigrams, i.e., syntactic relationships between pairs of words. We present the results of experiments that quantify the representational adequacy of the grammatical bigram model, its ability to generalize from labelled data, and its ability to induce syntactic structure from large amounts of raw text.
منابع مشابه
Synthetic Grammar Learning: Implicit Rule Abstraction or Explicit Fragmentary Knowledge?
3 experiments were designed to demonstrate that classifying new letter strings as grammatical (i.e., conforming to a set of rules called a synthetic grammar) or ungrammatical may proceed from fragmentary conscious knowledge of the bigrams constituting the grammatical strings displayed in the study phase, rather than from an unconscious structured representation of the grammar, as Reber (1989) c...
متن کاملAuthor identification in short texts
Most research on author identification considers large texts. Not many research is done on author identification for short texts, while short texts are commonly used since the rise of digital media. The anonymous nature of internet applications offers possibilities to use the internet for illegitimate purposes. In these cases, it can be very useful to be able to predict who the author of a mess...
متن کاملCubic-time Parsing and Learning Algorithms for Grammatical Bigram Models
This technical report presents a probabilistic model of English grammar that is based upon “grammatical bigrams”, i.e., syntactic relationships between pairs of words. Because of its simplicity, the grammatical bigram model admits cubic-time parsing and unsupervised learning algorithms, which are described in detail.
متن کاملDiscarding impossible events from statistical language models
This paper describes a method for detecting impossible bigrams from a space of V 2 bigrams where V is the size of the vocabulary. The idea is to discard all the ungrammatical events which are impossible in a well written text and consequently to expect an improvement of the language model. We expect also, in speech recognition, to reduce the complexity of the search algorithm by making less com...
متن کاملA Voice Dictation System for a Million-Word Czech Vocabulary
The paper describes a set of techniques developed for discrete dictation within a vocabulary that contains up to a million entries, which is one of the main challenges in highly inflected languages like Czech. We present our approach to building an efficiently coded tree lexicon with suffix sub-trees and morphologic classification. Acoustic modeling is based on either monophone, diphone, or tri...
متن کامل